PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds
نویسندگان
چکیده
MOTIVATION The explosion of next-generation sequencing data has spawned the design of new algorithms and software tools to provide efficient mapping for different read lengths and sequencing technologies. In particular, ABI's sequencer (SOLiD system) poses a big computational challenge with its capacity to produce very large amounts of data, and its unique strategy of encoding sequence data into color signals. RESULTS We present the mapping software, named PerM (Periodic Seed Mapping) that uses periodic spaced seeds to significantly improve mapping efficiency for large reference genomes when compared with state-of-the-art programs. The data structure in PerM requires only 4.5 bytes per base to index the human genome, allowing entire genomes to be loaded to memory, while multiple processors simultaneously map reads to the reference. Weight maximized periodic seeds offer full sensitivity for up to three mismatches and high sensitivity for four and five mismatches while minimizing the number random hits per query, significantly speeding up the running time. Such sensitivity makes PerM a valuable mapping tool for SOLiD and Solexa reads. AVAILABILITY http://code.google.com/p/perm/
منابع مشابه
Seed-Set Construction by Equi-entropy Partitioning for Efficient and Sensitive Short-Read Mapping
Spaced seeds have been shown to be superior to continuous seeds for efficient and sensitive homology search based on the seedand-extend paradigm. Much the same is true in genome mapping of high-throughput short-read data. However, a highly sensitive search with multiple spaced patterns often requires the use of a great amount of index data. We propose a novel seed-set construction method for ef...
متن کاملSupplementary Methods Table 2. the Maximum Weight for Single and Paired Seeds Period at Solid Specific Sensitivity Levels Full Sensitive to 1 Base + 1 Color Substitutions 2 Base Substitutions Period Length
Despite much research [1] [2] [3] [4] [5] [6] [7] has been devoted to the optimization of multiple spaced seeds for different sensitivity criteria, we proposed the following three methods to generate full sensitive periodic multiple seeds. For large genome re-sequencing application, multiple index tables can be queried with the MapReduce framework as proposed in [8] to increase the mapping effi...
متن کاملWALT: fast and accurate read mapping for bisulfite sequencing
Whole-genome bisulfite sequencing (WGBS) has emerged as the gold-standard technique in genome-scale studies of DNA methylation. Mapping reads from WGBS requires unique considerations that make the process more time-consuming than in other sequencing applications. Typical WGBS data sets contain several hundred million reads, adding to this analysis challenge. We present the WALT tool for mapping...
متن کاملZOOM! Zillions of oligos mapped
MOTIVATION The next generation sequencing technologies are generating billions of short reads daily. Resequencing and personalized medicine need much faster software to map these deep sequencing reads to a reference genome, to identify SNPs or rare transcripts. RESULTS We present a framework for how full sensitivity mapping can be done in the most efficient way, via spaced seeds. Using the fr...
متن کاملAlgorithms and tools for the analysis of high throughput DNA sequencing data
High-throughput DNA sequencing technologies make it possible to determine the order of the nucleotides adenine, cytosine, guanine and thymine in DNA samples, resulting in millions of short strings (reads) over the alphabet (A, C, G, T). Advances in biological and biomedical research rely on the ability of bioinformatics to make sense out of that data with novel algorithms and tools. In this the...
متن کامل